Capstone Project

IBM Data Science Professional Certificate

New Bakery Business Inspection in Toronto, Ontario, Canada



1. Description of the problem and a discussion of the background.

go back to contents

Toronto is the capital city of the Canadian province of Ontario. With a recorded population of approximately 2.7 million in 2016, it is the most populous city in Canada and the fourth most populous city in North America. The diverse population of Toronto reflects its current and historical role as an important destination for immigrants to Canada. More than 50 percent of residents belong to a visible minority population group, and over 200 distinct ethnic origins are represented among its inhabitants. Toronto is an international centre of business, finance, arts, and culture, and is recognized as one of the most multicultural and cosmopolitan cities in the world. Toronto covers an area of 630 square kilometres (243 sq mi), with a maximum north–south distance of 21 km (13 mi). It has a maximum east–west distance of 43 km (27 mi) and it has a 46-kilometre (29 mi) long waterfront shoreline, on the northwestern shore of Lake Ontario.

Toronto encompasses a geographical area formerly administered by many separate municipalities. These municipalities have each developed a distinct history and identity over the years, and their names remain in common use among Torontonians. Former municipalities include East York, Etobicoke, Forest Hill, Mimico, North York, Parkdale, Scarborough, Swansea, Weston and York. Throughout the city there exist hundreds of small neighbourhoods and some larger neighbourhoods covering a few square kilometres.

The objective of this problem is to analyze and select the best locations in the city of Toronto, Canada to open new bakery. Utilizing data science methodology and instruments such data analysis and data visualization project aims to provide new insights for declared business problem.

2. Description of the data and how it will be used to solve the problem.

go back to contents

To proceed with research we will use such data:

  • postal codes, boroughs, neighborhoods info on Toronto, Canada
  • postal codes latitude and longitude coordinates
  • venue data of bakeries and pastries

Data sources to help:

  • postal codes, boroughs, neighborhoods info on Toronto, Canada by Wikipedia
  • postal codes latitude and longitude coordinates by Google Maps API
  • venue data of bakeries and pastries by Foursquare API

3. Data wrangling

go back to contents

Preparation

In [26]:
# import libraries for data
import pandas as pd
import numpy as np
In [27]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim 

# map rendering library
#pip install folium # uncomment this line to install Folium
import folium 

# library to handle requests
import requests
In [28]:
# defining Toronto coordinate to initiate map later on
toronto_geolocator = Nominatim(user_agent="toronto_explorer")

toronto_address = 'Toronto, Ontario'
toronto_location = toronto_geolocator.geocode(toronto_address)

toronto_latitude = toronto_location.latitude
toronto_longitude = toronto_location.longitude

print('The geograpical coordinate of Toronto are {}, {}.'.format(toronto_latitude, toronto_longitude))
The geograpical coordinate of Toronto are 43.6534817, -79.3839347.

Import data

In [29]:
# import JSON file with Toronto venues from previous task
table = pd.read_json(r'toronto_venues.json')
table.head()
Out[29]:
Postal Code Postal Code Latitude Postal Code Longitude Venue Venue Latitude Venue Longitude Venue Category
0 M3A 43.753259 -79.329656 Allwyn's Bakery 43.759840 -79.324719 Caribbean Restaurant
1 M3A 43.753259 -79.329656 Donalda Golf & Country Club 43.752816 -79.342741 Golf Course
2 M3A 43.753259 -79.329656 Brookbanks Park 43.751976 -79.332140 Park
3 M3A 43.753259 -79.329656 Tim Hortons 43.760668 -79.326368 Café
4 M3A 43.753259 -79.329656 LCBO 43.757774 -79.314257 Liquor Store

Data examination

We need to segment all the venues to find our competitors. So, the next step will be to filter out all the bakeries. Also after exploring all the venues, we will consider such categories as bagel shop, creperie, cupcake shop, donut shop, pastry shop, pie shop, and sandiwch place. It is manual work, but still has to be done.

In [30]:
# making list of competitors
competitors = ['Bakery', 
               'Bagel Shop', 
               'Creperie', 
               'Cupcake Shop', 
               'Donut Shop', 
               'Pastry Shop', 
               'Pie Shop',
               'Sandwich Place']

Let's explore most saturated Postal Codes with venues

In [31]:
# filtering by list of competitors
toronto_venues = table.copy()
toronto_venues_competitors = toronto_venues[toronto_venues['Venue Category'].isin(competitors)]

Let's visualize all the competitors on the map

In [32]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=14)

# adding competitors to the map
for lat, lng, category, venue in zip(toronto_venues_competitors['Venue Latitude'], 
                                           toronto_venues_competitors['Venue Longitude'], 
                                           toronto_venues_competitors['Venue Category'], 
                                           toronto_venues_competitors['Venue']):
    label = '{}, {}'.format(category, venue)
    label = folium.Popup(label, parse_html=True)
    folium.Marker(
        [lat, lng],
        popup=label,
        parse_html=False).add_to(map_toronto)

map_toronto
Out[32]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Data transformation

To proceed further let's count venues by each postal code. That helps us to understand more popular places where all the traffic is.

In [33]:
# count venues in every Postal Code
toronto_venues_count = toronto_venues.groupby('Postal Code').count().sort_values(by='Venue', ascending=False)

# count competitors in every Postal Code
toronto_venues_competitors_count = toronto_venues_competitors.groupby('Postal Code').count().sort_values(by='Venue', ascending=False)

# adding number of competitors to previous table
toronto_venues_count['Number of Competitors'] = toronto_venues_competitors_count['Venue']

#replace NaN to 0
toronto_venues_count['Number of Competitors'].fillna(0, inplace=True) 

# making column as integer data type for consistency
toronto_venues_count['Number of Competitors'] = toronto_venues_count['Number of Competitors'].astype(int) 
toronto_venues_count['Percent of Competitors'] = round(toronto_venues_count['Number of Competitors'] / toronto_venues_count['Venue'] * 100, 2)

# delete unnecessary columns
toronto_venues_count = toronto_venues_count[['Venue', 'Number of Competitors', 'Percent of Competitors']]

print(toronto_venues_count.shape)
toronto_venues_count.head()
(103, 3)
Out[33]:
Venue Number of Competitors Percent of Competitors
Postal Code
M4X 100 6 6.0
M7Y 100 5 5.0
M6R 100 5 5.0
M4L 100 5 5.0
M4M 100 4 4.0
One hot encoding
In [34]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Postal Code'] = toronto_venues['Postal Code']

# add Postal Code as first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# drop each of competitors to clean dataframe
toronto_onehot.drop(competitors, axis=1, inplace=True)
print(toronto_onehot.shape)
(6894, 339)
In [35]:
# group rows by mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()

# set index and add columns about competitors
toronto_grouped.set_index('Postal Code', inplace=True)

toronto_grouped['Number of Competitors'] = toronto_venues_count['Number of Competitors']
toronto_grouped['Percent of Competitors'] = toronto_venues_count['Percent of Competitors']

print(toronto_grouped.shape)
(103, 340)
In [36]:
# function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
In [37]:
# making dataframe with top-5 venues for each postal code
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = toronto_grouped.index

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.drop(['Percent of Competitors', 
                                                                                                'Number of Competitors'], axis=1).iloc[ind, :], num_top_venues)
    
neighborhoods_venues_sorted.head()
Out[37]:
Postal Code 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue
0 M1B Zoo Exhibit Fast Food Restaurant Pizza Place Restaurant Coffee Shop
1 M1C Park Breakfast Spot Neighborhood Gym / Fitness Center Gym
2 M1E Pizza Place Breakfast Spot Bank Fast Food Restaurant Coffee Shop
3 M1G Coffee Shop Pharmacy Pizza Place Fast Food Restaurant Supermarket
4 M1H Coffee Shop Indian Restaurant Restaurant Gas Station Clothing Store
In [ ]:
 

4. K-Mean clustering

go back to contents

In [38]:
import plotly.express as px
In [53]:
import plotly
plotly.offline.init_notebook_mode()

To choose number of clusters let's use 'elbow mothed'. (Please, see more details on the internet if you are interested)

In [39]:
X = toronto_grouped.copy()

# calculate distortion for a range of number of cluster
distortions = []

for i in range(1, 11):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(X)
    distortions.append(km.inertia_)

# plot
df_distortions = pd.DataFrame(distortions, columns=['distortions'])
df_distortions['clusters'] = range(1,11)
figure_5 = px.line(df_distortions, x="clusters", y="distortions")
figure_5.show()
In [40]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.copy()

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 
Out[40]:
array([0, 0, 0, 4, 2, 1, 4, 2, 4, 0, 4, 4, 4, 4, 2, 1, 3, 2, 2, 0, 4, 0,
       4, 4, 1, 0, 4, 4, 0, 0, 0, 0, 0, 0, 0, 4, 2, 2, 1, 2, 2, 4, 2, 4,
       2, 0, 0, 2, 2, 2, 4, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0,
       2, 2, 0, 0, 0, 0, 0, 2, 2, 2, 1, 2, 2, 4, 1, 4, 2, 2, 2, 0, 2, 2,
       4, 0, 0, 4, 4, 4, 4, 4, 4, 1, 1, 4, 4, 4, 4])
In [41]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = table.copy()

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')

# cleaning and adjusting dataframe 
toronto_merged = toronto_merged.dropna()
toronto_merged.reset_index(inplace=True)
toronto_merged['Cluster Labels'] = toronto_merged['Cluster Labels'].astype(int)
toronto_merged.drop(['index'], axis=1, inplace=True)

print(toronto_merged.shape)
toronto_merged.head()
(6894, 13)
Out[41]:
Postal Code Postal Code Latitude Postal Code Longitude Venue Venue Latitude Venue Longitude Venue Category Cluster Labels 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue
0 M3A 43.753259 -79.329656 Allwyn's Bakery 43.759840 -79.324719 Caribbean Restaurant 0 Coffee Shop Pharmacy Intersection Gas Station Supermarket
1 M3A 43.753259 -79.329656 Donalda Golf & Country Club 43.752816 -79.342741 Golf Course 0 Coffee Shop Pharmacy Intersection Gas Station Supermarket
2 M3A 43.753259 -79.329656 Brookbanks Park 43.751976 -79.332140 Park 0 Coffee Shop Pharmacy Intersection Gas Station Supermarket
3 M3A 43.753259 -79.329656 Tim Hortons 43.760668 -79.326368 Café 0 Coffee Shop Pharmacy Intersection Gas Station Supermarket
4 M3A 43.753259 -79.329656 LCBO 43.757774 -79.314257 Liquor Store 0 Coffee Shop Pharmacy Intersection Gas Station Supermarket
Map with clusters by venue
In [42]:
# create map
map_venue_clusters = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Venue Latitude'], 
                                  toronto_merged['Venue Longitude'], 
                                  toronto_merged['Venue Category'], 
                                  toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_venue_clusters)
       
map_venue_clusters
Out[42]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]:
 

5. Investigating clusters

go back to contents

Let's find out if there any difference between clusters by venues amount and

In [43]:
# making dataframe with info about postal codes/clusters/the most popular venues
toronto_postal_codes_clustered = toronto_merged.drop_duplicates(subset=['Postal Code'])
toronto_postal_codes_clustered = toronto_postal_codes_clustered.reset_index(drop=True)
toronto_postal_codes_clustered.set_index('Postal Code', inplace=True)
toronto_postal_codes_clustered.drop(['Venue', 'Venue Category', 'Venue Latitude', 'Venue Longitude'], axis=1, inplace=True)

toronto_postal_codes_clustered['Number of Competitors'] = toronto_venues_count['Number of Competitors']
toronto_postal_codes_clustered['Percent of Competitors'] = toronto_venues_count['Percent of Competitors']
toronto_postal_codes_clustered['Venues'] = toronto_venues_count['Venue']

print(toronto_postal_codes_clustered.shape)
toronto_postal_codes_clustered.head()
(103, 11)
Out[43]:
Postal Code Latitude Postal Code Longitude Cluster Labels 1st Most Common Venue 2nd Most Common Venue 3rd Most Common Venue 4th Most Common Venue 5th Most Common Venue Number of Competitors Percent of Competitors Venues
Postal Code
M3A 43.753259 -79.329656 0 Coffee Shop Pharmacy Intersection Gas Station Supermarket 1 2.50 40
M4A 43.725882 -79.315572 0 Coffee Shop Gym Fast Food Restaurant Middle Eastern Restaurant Shoe Store 1 1.72 58
M5A 43.654260 -79.360636 0 Coffee Shop Restaurant Café Park Pub 3 3.00 100
M6A 43.718518 -79.464763 0 Clothing Store Coffee Shop Fast Food Restaurant Restaurant Dessert Shop 3 3.00 100
M7A 43.662301 -79.389494 0 Coffee Shop Café Park Gastropub Pizza Place 2 2.00 100
In [44]:
# making tabel with total calculations for all clusters
clusters_total_calc = {'Cluster': toronto_postal_codes_clustered['Cluster Labels'].unique(),
                       'Num of Competitors': list(toronto_postal_codes_clustered.groupby('Cluster Labels')['Number of Competitors'].sum()),
                       'Venues Total': list(toronto_postal_codes_clustered.groupby('Cluster Labels')['Venues'].sum()),
                       'Percent of Competitors': list(toronto_postal_codes_clustered.groupby('Cluster Labels')['Percent of Competitors'].mean())
                      }
postal_codes_clusters_total = pd.DataFrame(data=clusters_total_calc)
postal_codes_clusters_total
Out[44]:
Cluster Num of Competitors Venues Total Percent of Competitors
0 0 46 2201 1.803333
1 4 58 474 12.645556
2 2 169 2658 6.551613
3 1 1 3 33.330000
4 3 78 1558 5.363793

As we can see top-3 clusters by total venues are Cluster 0, Cluster 2, and Cluster 3. But number of competitors is highest in Cluster 2 (and venues total also).

In [45]:
toronto_postal_codes_cluster_2 = toronto_postal_codes_clustered.loc[(toronto_postal_codes_clustered['Cluster Labels'] == 2) &
                                                                   (toronto_postal_codes_clustered['Venues'] > 99)]

toronto_postal_codes_cluster_2 = toronto_postal_codes_cluster_2.sort_values(by='Number of Competitors', ascending=False)
In [46]:
# filtering postal codes with lowest competitors number in Cluster 2
toronto_postal_codes_cluster_2_potential = toronto_postal_codes_cluster_2.loc[toronto_postal_codes_cluster_2['Number of Competitors'] < 6]
In [47]:
# making dataframe with info about venues/clusters
toronto_venues_clustered = toronto_merged.copy()
# toronto_venues_clustered = toronto_venues_clustered.reset_index(drop=True)
toronto_venues_clustered.drop(['1st Most Common Venue', 
                               '2nd Most Common Venue', 
                               '3rd Most Common Venue', 
                               '4th Most Common Venue',
                               '5th Most Common Venue'], axis=1, inplace=True)

print(toronto_venues_clustered.shape)
toronto_venues_clustered.head()
(6894, 8)
Out[47]:
Postal Code Postal Code Latitude Postal Code Longitude Venue Venue Latitude Venue Longitude Venue Category Cluster Labels
0 M3A 43.753259 -79.329656 Allwyn's Bakery 43.759840 -79.324719 Caribbean Restaurant 0
1 M3A 43.753259 -79.329656 Donalda Golf & Country Club 43.752816 -79.342741 Golf Course 0
2 M3A 43.753259 -79.329656 Brookbanks Park 43.751976 -79.332140 Park 0
3 M3A 43.753259 -79.329656 Tim Hortons 43.760668 -79.326368 Café 0
4 M3A 43.753259 -79.329656 LCBO 43.757774 -79.314257 Liquor Store 0
In [48]:
# filtering out only Cluster 2
toronto_venues_cluster_2 = toronto_venues_clustered.loc[toronto_venues_clustered['Cluster Labels'] == 2]
toronto_venues_cluster_2.shape
Out[48]:
(2658, 8)
In [49]:
# create map
map_cluster_2 = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 2, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add Cluster 2 to the map
markers_colors = []
for lat_c2, lng_c2, category_c2, venue_c2 in zip(toronto_venues_cluster_2['Venue Latitude'], 
                                                 toronto_venues_cluster_2['Venue Longitude'], 
                                                 toronto_venues_cluster_2['Venue Category'], 
                                                 toronto_venues_cluster_2['Venue']):
    label_c2 = folium.Popup(str(category) + ' Venue ' + str(venue), parse_html=True)
    folium.CircleMarker(
        [lat_c2, lng_c2],
        radius=4,
        popup=label_c2,
        color=rainbow[0],
        fill=True,
        fill_color=rainbow[0],
        fill_opacity=0.7).add_to(map_cluster_2)

# adding potential postal codes to the map
for lat_p, lng_p, postal_code in zip(toronto_postal_codes_cluster_2_potential['Postal Code Latitude'], 
                                     toronto_postal_codes_cluster_2_potential['Postal Code Longitude'], 
                                     toronto_postal_codes_cluster_2_potential.index 
                                    ):
    label_p = folium.Popup('Postal Code: {}'.format(postal_code), parse_html=True)
    folium.Marker(
        [lat_p, lng_p],
        popup=label_p,
        parse_html=False).add_to(map_cluster_2)


# adding competitors to the map
for lat_comp, lng_comp, category_comp, venue_comp in zip(toronto_venues_competitors['Venue Latitude'], 
                                                         toronto_venues_competitors['Venue Longitude'], 
                                                         toronto_venues_competitors['Venue Category'], 
                                                         toronto_venues_competitors['Venue']):
    label_comp = folium.Popup('{}, {}'.format(category_comp, venue_comp), parse_html=True)
    folium.CircleMarker(
        [lat_comp, lng_comp],
        radius=10,
        popup=label_comp,
        color=rainbow[1],
        fill=True,
        fill_color=rainbow[1],
        fill_opacity=0.7).add_to(map_cluster_2)
    
map_cluster_2
Out[49]:
Make this Notebook Trusted to load map: File -> Trust Notebook

As we can see from the map tho postal codes M5B and M5C situated in area with high amount of venues (so there is nice traffic) and have not so much competitors around. Let's discover what boroughs these are!

In [50]:
# import JSON file with Toronto boroughs from previous task
toronto_boroughs = pd.read_json(r'toronto_boroughs.json')

# looking for the most perspective boroughs by posta code
toronto_boroughs.loc[(toronto_boroughs['Postal Code'] == 'M5B') | (toronto_boroughs['Postal Code'] == 'M5C')]
Out[50]:
Postal Code Borough Neighborhood Latitude Longitude
9 M5B Downtown Toronto Garden District, Ryerson 43.657162 -79.378937
15 M5C Downtown Toronto St. James Town 43.651494 -79.375418

6. Conclusion

go back to contents

A lot of things happened, so let's sum up what was going on.

First of all we made list of all venue categories and filtered out manually competitors only. The next step was to count number of competitors and their percentage from total venues amount by every postal code.

To tranform text data we used one hot encoding. That was preliminary step to use unsupervised machine learning technique K-Mean Clustering. As we didn't know number of clusters we used 'Elbow method' (see here for details). Afterwards we put number of custers in models to fit it.

Clustering helped us to pick up areas with highest potential. We filtered out most saturated with competitors areas.

The last step was to add three layers to the maps: venues, competitors and preferred postal codes, so we can see easily postal codes we need.

Thank you!

In [ ]: